I decided to change the name “count” to “total_rent” to avoid misunderstanding in histogram plots
setwd("~/Dropbox/MOOCs/Nanodegree/03_eda/Giulio Ministeri")
dts <- read.delim('train.csv', sep = ",")
library(plyr)
dts <-rename(dts, replace=c("count" = "total_rent"))
dim(dts)
## [1] 10886 12
names(dts)
## [1] "datetime" "season" "holiday" "workingday" "weather"
## [6] "temp" "atemp" "humidity" "windspeed" "casual"
## [11] "registered" "total_rent"
summary(dts)
## datetime season holiday
## 2011-01-01 00:00:00: 1 Min. :1.000 Min. :0.00000
## 2011-01-01 01:00:00: 1 1st Qu.:2.000 1st Qu.:0.00000
## 2011-01-01 02:00:00: 1 Median :3.000 Median :0.00000
## 2011-01-01 03:00:00: 1 Mean :2.507 Mean :0.02857
## 2011-01-01 04:00:00: 1 3rd Qu.:4.000 3rd Qu.:0.00000
## 2011-01-01 05:00:00: 1 Max. :4.000 Max. :1.00000
## (Other) :10880
## workingday weather temp atemp
## Min. :0.0000 Min. :1.000 Min. : 0.82 Min. : 0.76
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:13.94 1st Qu.:16.66
## Median :1.0000 Median :1.000 Median :20.50 Median :24.24
## Mean :0.6809 Mean :1.418 Mean :20.23 Mean :23.66
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:26.24 3rd Qu.:31.06
## Max. :1.0000 Max. :4.000 Max. :41.00 Max. :45.45
##
## humidity windspeed casual registered
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 0.0
## 1st Qu.: 47.00 1st Qu.: 7.002 1st Qu.: 4.00 1st Qu.: 36.0
## Median : 62.00 Median :12.998 Median : 17.00 Median :118.0
## Mean : 61.89 Mean :12.799 Mean : 36.02 Mean :155.6
## 3rd Qu.: 77.00 3rd Qu.:16.998 3rd Qu.: 49.00 3rd Qu.:222.0
## Max. :100.00 Max. :56.997 Max. :367.00 Max. :886.0
##
## total_rent
## Min. : 1.0
## 1st Qu.: 42.0
## Median :145.0
## Mean :191.6
## 3rd Qu.:284.0
## Max. :977.0
##
First, I noticed that “datetime” feature includes too many information, I splitted into four new features: hour, day, month and year (minutes and seconds are always equal to 0).
After the extraction:
## season holiday workingday weather
## Min. :1.000 Min. :0.00000 Min. :0.0000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:1.000
## Median :3.000 Median :0.00000 Median :1.0000 Median :1.000
## Mean :2.507 Mean :0.02857 Mean :0.6809 Mean :1.418
## 3rd Qu.:4.000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:2.000
## Max. :4.000 Max. :1.00000 Max. :1.0000 Max. :4.000
## temp atemp humidity windspeed
## Min. : 0.82 Min. : 0.76 Min. : 0.00 Min. : 0.000
## 1st Qu.:13.94 1st Qu.:16.66 1st Qu.: 47.00 1st Qu.: 7.002
## Median :20.50 Median :24.24 Median : 62.00 Median :12.998
## Mean :20.23 Mean :23.66 Mean : 61.89 Mean :12.799
## 3rd Qu.:26.24 3rd Qu.:31.06 3rd Qu.: 77.00 3rd Qu.:16.998
## Max. :41.00 Max. :45.45 Max. :100.00 Max. :56.997
## casual registered total_rent hour
## Min. : 0.00 Min. : 0.0 Min. : 1.0 Min. : 1.00
## 1st Qu.: 4.00 1st Qu.: 36.0 1st Qu.: 42.0 1st Qu.: 7.00
## Median : 17.00 Median :118.0 Median :145.0 Median :13.00
## Mean : 36.02 Mean :155.6 Mean :191.6 Mean :12.54
## 3rd Qu.: 49.00 3rd Qu.:222.0 3rd Qu.:284.0 3rd Qu.:19.00
## Max. :367.00 Max. :886.0 Max. :977.0 Max. :24.00
## day month year
## Min. : 1.000 Min. : 1.000 2011:5422
## 1st Qu.: 5.000 1st Qu.: 4.000 2012:5464
## Median :10.000 Median : 7.000
## Mean : 9.993 Mean : 6.521
## 3rd Qu.:15.000 3rd Qu.:10.000
## Max. :19.000 Max. :12.000
## season holiday workingday weather temp atemp humidity windspeed casual
## 1 1 0 0 1 9.84 14.395 81 0.0000 3
## 2 1 0 0 1 9.02 13.635 80 0.0000 8
## 3 1 0 0 1 9.02 13.635 80 0.0000 5
## 4 1 0 0 1 9.84 14.395 75 0.0000 3
## 5 1 0 0 1 9.84 14.395 75 0.0000 0
## 6 1 0 0 2 9.84 12.880 75 6.0032 0
## registered total_rent hour day month year
## 1 13 16 1 1 1 2011
## 2 32 40 2 1 1 2011
## 3 27 32 3 1 1 2011
## 4 10 13 4 1 1 2011
## 5 1 1 5 1 1 2011
## 6 1 1 6 1 1 2011
Looking at the column “season” I noticed two things:Of course the holidays are very rare and working days are more frequent. Temperature is very warm, and humidity is very high and, in fact the felt temperature is bigger of 3 to 5 degrees. Wind is also a constant presence, but never annoying: looking at the quartiles, (and assuming wind is measured in knots) the wind speed is below the 17 knots, I am not so expert about wind speeds but I think that at these speeds wind can be considered almost a breeze.
As a first step I want to take a look at data about weather and climate.
## 1 2 3 4
## 2686 2733 2733 2734
As already noticed, data are equally distributed among seasons.
## 1 2 3 4
## 7192 2834 859 1
It is a very sunny place, very few days with rain or bad weather.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.82 13.94 20.50 20.23 26.24 41.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.76 16.66 24.24 23.66 31.06 45.46
Also temperature seems not to be a problem for bikers, very few days with prohibitive temperatures. The temperature distribution has an indefinable shape, but I think I can say it is unimodal, close to the gaussian shape.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 47.00 62.00 61.89 77.00 100.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 7.002 13.000 12.800 17.000 57.000
Very sunny and warm place, lot of humidity and quite often wind with low speed. It seems we are in a city near the coast. I am not from U.S. and I am not a climate expert but I thing Washington D.C. can be considered a coast city, or at least a city where the influence of the sea on local climate is strong.
For the sake of completeness, before diving into the biker rental numbers, I graph the holidays and working days histogram.
## 0 1
## 10575 311
## 0 1
## 3474 7412
As expected most of the days are working days and very few are holidays. Let me check if these two variables are mutually exclusive, i.e. if it is a working day it is not an holiday and viceversa. To see I compute the cardinality of two sets made from the entries that satisfy the conditions: “workingday and holiday” and “not workingday and not holiday”.
## [1] 0 15
## [1] 132 15
Holidays are not working days (as expected), but not all the not workingdays are holidays, probably weekends are considered neither working days nor holidays. A quick check in a 2011 calendar reveals my hypothesis is right.
## season holiday workingday weather temp atemp humidity windspeed
## 12 1 0 0 1 14.76 16.665 81 19.0012
## 35 1 0 0 2 14.76 16.665 71 16.9979
## 173 1 0 0 2 8.20 9.090 69 26.0027
## casual registered total_rent hour day month year
## 12 26 30 56 12 1 1 2011
## 35 16 54 70 12 2 1 2011
## 173 2 60 62 12 8 1 2011
The added features about time:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 455 454 448 433 442 452 455 455 455 455 455 455 456 456 456 456 456 456
## 19 20 21 22 23 24
## 456 456 456 456 456 456
Some missing entries in the early hours of the day.
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 575 573 573 574 575 572 574 574 575 572 568 573 574 574 574 574 575 563
## 19
## 574
In practice the messing points are so few that I can consider the dataset to equally span over the 19 days.
## 1 2 3 4 5 6 7 8 9 10 11 12
## 884 901 901 909 912 912 912 912 909 911 911 912
Same consideration for “month”. Probably the missing point are due to some outages of the service that collected data.
## 2011 2012
## 5422 5464
The distribution of the dataset equally covers 2011 and 2012. Within the year, the dataset covers the firsts 19 days of every month, from 0 to 24.
Now the bike rental data
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 42.0 145.0 191.6 284.0 977.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 36.0 118.0 155.6 222.0 886.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 4.00 17.00 36.02 49.00 367.00
which I think they would be clearer in logarithmic scale
Total and registered rentals have almost the same distribution, while casual riders distribution is quite different and ranges in an interval almost an order of magnitude smaller than the registered riders distribution. The total and registered rentals seems to have a bimodal distribution, probably a combination of two gaussian distributions, while the casual rents look like an exponential distribution. I want to check the weight of casual riders in the total count. Here I compute the ratio between casual rentals and total rentals, of course for those entries not equal to zero.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.06169 0.14470 0.17100 0.25340 1.00000
For the most of the time the casual riders are a small part of the “total_rent”; very often the casual riders are zero.
The dataset has 10886 rows and 11 features: datetime, season, holiday, workingday, weather, temp, atemp, humidity, windspeed, casual rentals, registered rentals, total rentals. Five features (datetime, season, holiday, workingday, weather) are categorical variables, while the others can be considered continuous variables.
The dataset spans over 2011 and 2012 years, data come from the firsts 19 days of every month of those years.
The climate data reveal we are in a city near the coast where temperature is warm for the entire year and wind is a constant presence.
The rentals come most from registered rentals, while the casual rentals are a small fraction, on average they are the 17 % of the total.
To be sure of not leaving out a feature, I plot a quick graph about
Of course the most important features of the dataset are ones about rentals. The goal here is to determine which of the other features are best for predicting rentals counts. After this first review I think that “workingday” and “hour” will be two of the most effective features for prediction. Probably, among climate features, “weather” and “temperature” will be the most useful.
In my opinion “month” or “season” would be in some way useful to identify casual rentals made by tourists, probably also windspeed would influence the rentals count. On the contrary I think that “year”, or “day” will be useless or very ineffective, also “atemp” and “humidity” would be useless since they are repetitions of other features.
Yes I did. The original dataset enclosed information about hour of the day, day of the month, month of the year and year in one single timestamp variable called “datetime”. I splitted this feature into four new features and removed the timestamp. Minutes and seconds have been discarded since they are always 0.
Yes I transformed in logarithmic scale the rentals data. The distribution is strongly skewed to 0 value; while the log-transformed distributions appear to be bimodal for “registered” and “total_rent” with two peaks at around 8 and 200. The casual rentals have an indefinite distribution, even if it reminds an exponential one.
I think that from now on I can ignore “total_rent” feature and analyze “registered” and “casual” features in separate ways. Total rentals count is always the sum of the other two rent types, so If good predictive models can be found for registered and casual a model for “total_rent” would be useless.
As a first step I analyze data on wheather:
## dts$season: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.82 9.02 12.30 12.53 16.40 29.52
## --------------------------------------------------------
## dts$season: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.84 18.86 22.96 22.82 26.24 38.54
## --------------------------------------------------------
## dts$season: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.58 26.24 28.70 28.79 31.16 41.00
## --------------------------------------------------------
## dts$season: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.74 13.12 16.40 16.65 20.50 30.34
And these histograms confirm the mistake in season labeling, both the temperature distributions and the bad wheather counts show that the labels expressed by the competion authors are wrong. Season 1 must is winter and 3 is summer for sure. Looking at the number of rainy days I would guess that season 2 is spring a 4 is fall.
## weather temp atemp humidity windspeed
## weather 1.000000000 -0.05503542 -0.05537597 0.40624365 0.007261124
## temp -0.055035418 1.00000000 0.98494811 -0.06494877 -0.017852010
## atemp -0.055375973 0.98494811 1.00000000 -0.04353571 -0.057473002
## humidity 0.406243651 -0.06494877 -0.04353571 1.00000000 -0.318606992
## windspeed 0.007261124 -0.01785201 -0.05747300 -0.31860699 1.000000000
## registered -0.109340372 0.31857128 0.31463539 -0.26545787 0.091051662
## registered
## weather -0.10934037
## temp 0.31857128
## atemp 0.31463539
## humidity -0.26545787
## windspeed 0.09105166
## registered 1.00000000
## weather temp atemp humidity windspeed
## weather 1.000000000 -0.05503542 -0.05537597 0.40624365 0.007261124
## temp -0.055035418 1.00000000 0.98494811 -0.06494877 -0.017852010
## atemp -0.055375973 0.98494811 1.00000000 -0.04353571 -0.057473002
## humidity 0.406243651 -0.06494877 -0.04353571 1.00000000 -0.318606992
## windspeed 0.007261124 -0.01785201 -0.05747300 -0.31860699 1.000000000
## casual -0.135917680 0.46709706 0.46206654 -0.34818690 0.092276189
## casual
## weather -0.13591768
## temp 0.46709706
## atemp 0.46206654
## humidity -0.34818690
## windspeed 0.09227619
## casual 1.00000000
The response variable most influenced by temperature is “casual” rentals, while in the registered rentals the influence is still present but is less noticeable.
“temp” and “atemp” are very correlated, they almost have a determistic relation.
Even if it looks that temperature affects registered rentals through a linear or even an exponential function, I think that it works more as an indicator function: below 10°degrees rentals have a uniform distribution very tight to 0, but as temperature rises above 10° the distribution spreads uniformly from 0 to 800. For casual rentals exponential relation seems to be stronger, but, the behavior is almost the same, probably the threshold value is little bit higher.
Looking at the correlation coefficient, the influence of weather on rentals is remarkable for both the type of rents. Looking at the box plot, it seems that the weather affects the response variable mainly on the variance, and in small part on the average value. As for “temp”, I think that the “weather” feature works as a discriminant variable too: in case of bad weather we can be confident on predicting a low number of rents; on the contrary in case of sunny weather our estimation confidence decreases because lot more new (and maybe unknown) variables comes into play. Putting it simply: if it is raining, surely very few people will take the bike and blindly saying that no one will take a bike will not be so far from truth; if it is sunny, instead, people will be completely free to decide to go for a ride or not, and we will need a way more features to try and estimate the number of rents.
A look on humidity data:
And as expected humidity is correlated to weather, The worse the weather is, the higher the humidity level is.
As for “weather”, the windspeed influences rents, almost only in the variance behavior. In case of very strong wind, rentals distribution tightens to zero. As “temp” the windspeed does not seem to have linear relation but more as a step function with a threshold at around 35 / 38 knots. There is a strange hole in windspeed data, like if the wind never has speeds aroung 5 knots; one possible reason could lay in the equipment that measures the windspeed which may need a minumum windspeed to move from the zero.
## workingday holiday hour day
## workingday 1.000000000 -0.2504913912 0.0027802345 0.009829427
## holiday -0.250491391 1.0000000000 -0.0003541611 -0.015877453
## hour 0.002780234 -0.0003541611 1.0000000000 0.001132431
## day 0.009829427 -0.0158774525 0.0011324308 1.000000000
## month -0.003394394 0.0017314575 -0.0068176434 0.001973618
## season -0.008126058 0.0293676097 -0.0065456741 0.001728863
## registered 0.119459851 -0.0209556729 0.3805397262 0.019110637
## month season registered
## workingday -0.003394394 -0.008126058 0.11945985
## holiday 0.001731457 0.029367610 -0.02095567
## hour -0.006817643 -0.006545674 0.38053973
## day 0.001973618 0.001728863 0.01911064
## month 1.000000000 0.971523800 0.16945110
## season 0.971523800 1.000000000 0.16401053
## registered 0.169451102 0.164010534 1.00000000
## workingday holiday hour day
## workingday 1.000000000 -0.2504913912 0.0027802345 0.009829427
## holiday -0.250491391 1.0000000000 -0.0003541611 -0.015877453
## hour 0.002780234 -0.0003541611 1.0000000000 0.001132431
## day 0.009829427 -0.0158774525 0.0011324308 1.000000000
## month -0.003394394 0.0017314575 -0.0068176434 0.001973618
## season -0.008126058 0.0293676097 -0.0065456741 0.001728863
## casual -0.319110963 0.0437989287 0.3020454019 0.014108702
## month season casual
## workingday -0.003394394 -0.008126058 -0.31911096
## holiday 0.001731457 0.029367610 0.04379893
## hour -0.006817643 -0.006545674 0.30204540
## day 0.001973618 0.001728863 0.01410870
## month 1.000000000 0.971523800 0.09272204
## season 0.971523800 1.000000000 0.09675806
## casual 0.092722039 0.096758063 1.00000000
The behavior for registered and casual rentals based on hour is very different. Distribution for registered rentals are bimodal with two peaks at 9 am and at 6 pm. For casual reantals there is no sharp peak but the average number of rentals increases during the noon hours.
Also the “workingday” influence is strong but totally different for the two types of rents, registered rentals are higher if the day is a workingday; while for casual rentals the behavior is exactly the opposite.
The “month” and “season” features appear to be correlated to both the types of rents, and in the same way. I need to investigate more, but I think that “month” is in practice a repetition of the information about season. The main conclusion from these plots is that during the warmer seasons (spring and summer), rentals increase, while as we go towars winter rentals decrease.
Unbelievable, also “year” has an impact on the number of rentals, it appears that in 2012 the average number of rentals increases both in casual registered rentals.
These behaviors seem reasonable: registered rentals are made by working people who probably need bike to go to work; on the contrary casual rentals are made by people during the week-end who decided at the end to go for a ride. So for registered rents, being a working day makes the rentals go higher (positive correlation), for casual rentals being a not-working day increases the rentals (negative correlation).
“hour” feature correlation supports this hypothesis: registeredrentalsare made in working hours, the two peaks are at about 8 am and 6 pm; while for casualrentalsthe peak is the central hours of the day.
What force to think that the “humidity” correlation results does not reflects reality is the correlation between “humidity” itself and “weather”, and “weather” and response variables; I think that the correlation between “humidity” and response variables is pushed by the “weather” feature, whose correlation to response variables is hidden by its categorical nature. Second I can not believe that people check their portable hygrometer before renting a bike.
Casual rentals are common in spring fall and summmer, while registered rentals occur more frequently only in spring and summer. Probably because fall and winter days are less stable in terms of wheather, and hence people are less confident when they have to book in advance a bike.
The strongest correlation I saw is among “hour” of the day and rents. The linear model is not the most appropriate to explain the correlation: being at 01 or at 24 doesn’t change much the willingness of people to rent a bike. Using a non-linear transformation to “hour”, which counts, for example, for the time from and/or to the rush hours could increase the correlation to rentals count.
The numerous samples of rentals equal to very low values, make difficult to find a clear relation with the predictive variables.
From these scatter plots it is hard to understand which of the two features, “weather” and “humidity”, is more influent to the response variables. Even though the relationship between weather and humidity is clear, the relationship among weather and/or himidity to the response variable is not.
Rents have different behaviour in function of “temp” depending on the “season”. In both rent types there is a strange peak in winter when temperature is above 27°; even if at the same temperature, people behave differently whether it’s spring or fall.
As a matter of fact, “month” is not simply a repetition of “season” information, indeed the distributions of casual and registered rentals do not change based only on seasons; actually there are small changes also within the months of the same season. Especially for winter and fall monts, there is a remarkable difference from month to month. As a conclusion, considering only the spring and summer seasons, “month” and “season” have almost the same behavior and I would choose to adopt the simplest feature and discard “month”, but, considering the winter and fall seasons, I reconsider my previous opinion and I decided that it would be more reasonable to remove “season” and add “month” to the predictive set.
Once more, response variables in function of “hour” depends whather the day is a working day or not.
Now I’d like to investigate the behavior of rentals during the year, in function of the hour.
In these heat maps, number of rentals are represented as color gradient, the more red is the more numerours the rentals are, the x represents the hours of the day while in y axis the month of the year. Apart the winter months, registered rentals in working days, mainly come from working people who rent bikes to go to work. There is a small peak in August probably due to tourists, and a small drop in October I can not explain.
Casual rentals in not working days have a peak in April. I guess it is the people reaction to firsts sunny days of the coming spring with a sunday bike ride.
In this plots a specific zone, I called the “comfortable zone”, can be identfied in the lower right of the plot. In this zone, the number of riders is the highest. Note that the zone it’s not precisely the low right corner, but a bit more in the centre. Especially for casual rents, too high temperature it’s uncomfortable too. For “registered”, higher temperatures do not affect the number of rents as for “casual”; maybe because who decides to rent a bike in advance (registering) does not take into account temperatures but only weather.
“hour” and “workingday” are the most meaningful features for both the response variables.
The behavior of the two type of rentals is completely different in terms of these two features. For casual rents, being in a non-workingday is synonim of higher rentals on average, while for registered rentals it’s exactly the opposite. Rental trends assumes different shapes during the day: for casual rentals there is one fat peak during the central hours of the day; while for registered rentals there are two sharp peaks in the morning at 8am and in the evening at 6pm, i.e. when people go to, and come back from work.
Climate-related features, as expected, are more or less correlated each other and correlated to “season” index. “humidity” and “weather” are strongly correlated to each other; also “temp” and “atemp” are strongly correlated, the relation is almost deterministic.
The two distributions of registered and casual bike rentals are strongly skewed to zero. The problem resides in the dataset which, in my opinion, is too detailed. From my analysis emerges that there is no clear correlation among a feature (or a set of features) and the zero value occurences, but instead zeros are equally distributed among the different situations described by the provided features; whether wheather is nice or bad, whether it is winter or summer, whether it 9 am or 11pm, zeros, of course with different distributions, are always present. In my opinion if some data aggregation is applied, it would be possible to remove these zeros and at the same time preserve the nature of the dataset and the phenomena it describes.
In poor words, it’s like watching the sand with the microscope: it is very difficult to decide which color the single grains are; it would be better to asses the color of a handful of sand.
The seasonal trend of rentals shown month by month. Of course both registered and casual rentals increase as spring starts, reach the highest peak during summer, in fall decrease and have the lowest count during winter.
Casual rentals are more influenced by the summer season: the distribution of the average counts have a bell shape centered in July, probably because casual rentals are driven by summer vacations of residents, or maybe by tourists who probably are not used to register for rents.
The distribution drawn by the average number of registered rentals has a very wide bell shape whose peak is very hard to identify. My idea is that registered rentals are driven by resident people who use the bike for eveyday life to avoid traffic, so the trend reflects the hypothesis that resident people start to use the bike as long as the wheather allows it and quit to rent as the climate in winter become prohibitive for bikers.
With these graph of rentals distribution per hour splitted based on “workingday” feature convinces me once more that my hypothesis of who is driving the rentals is right. Registered and casual rentals have totally different behaviors both in terms of hours and whether the day is a working day or not.
For registered rentals the biggest rent counts happen during the rush hours of the working days, i.e. at 9am and at 6pm, when people has to move to work or come back home. For casual rentals the peak is in the noon hours of the non working days when people go out for a ride to enjoy the city in a sunny day.
The Bike rental dataset is composed of 10886 rows with 11 features. The response features represents the number of rents, divided by the type of rent: “registered” rentals for users who registered their request, and “casual” rentals for users who just pick the pick the bike. The “count” feature (renamed in “total_rent”) simply represents the sum of the two. The predictive feature can be splitted in two sets: time-related features and climate-related features. Time-related features include: year month, day, hour and additional information like whether the day is a working day or holiday. Climate-related features include: humidity, temperature, windspeed and weather conditions.
I started refactoring a bit the features to split information and represent them in a simpler and easier-to-use way. I started by exploring the variables one-by-one to understand the granularity of the dataset and the nature of the different variables. My first impression was that the dataset is very detailed: each row describe the situation in a single hour of a single day of the year. Of course with this level of detail, is is common to find dataset where variable that describes human-being activities counts lot of zero values.
My first thought was that climate-related variables would be very useful to predict bike rents; the elementary deduction was that to have a bycicle ride you need nice and warm weather. At the same time I noticed that the provided features were too many and that maybe some kind of redundancy were present. I approached climate-related data to seek for the most valuable features to predict response variables by looking if some kind of linear relation exists among them.
Unfortunatly I found that there is no strong relation, and that the weather and temperature conditions affects the variability of the rents. In case of bad weather or too cold or too hot temperatures, the rentals counts are pushed toward zero, but as the condition for a bike ride are good enough, the rentals do not assume precise or proportional value but, it happens that the variability of the rentals counts increases and the gap between minimum and maximum values incredibly increases.
After these considerations, I started to think that probably the changes seen in climatic analysis were partially due to climate itself, but probably it was a consequence of the weather changes on due to seasons. My reasoning were right ans as shown in plot 2, months is the most significant feature to track changes in rentals beahavior on an year-based analysis.
As last but not least I analyzed the rentals change within the days. I left this analyzed because I knew for sure that hour would be the most valid feature to predict rents. The analysis led to conclude that “registered” and “casual” rentals behave in totally different ways: the combination of “hour” and “workingday” shows that registered and casual rent counts are driven by people who have a totally different objective when renting a bike.
Anyway, building a model to predict rentals would a lot more work. “hour” and “month” features should be reworked to represents in better way the human behavior. “month”, for example, should be substitued with a feature that describes the location of the month in seasons and within that season, as a decimal number for example. “hour” feature should be aggregate in bigger time slots to remove the many zero valued entries in rental counts.